Content

Introduction

Analysis

1. Geographical distribution of diabetes prevalence and availability of grocery

2. Relationshihp between grocery contents and diabetes prevalence

3. Purchasing pattern in relationship to diabetes prevalence

4. Identification of community profiles with high risk of diabetes

5. Identification of community with high risk of diabetes.

Conclusion

Appendix

Reference

Introduction

Diabetes prevalence in London has been increasing as 6.7% of London population is diagnosed with one (Diabetes UK, 2021). It thus urges the collective action to decrease diabetes. Since there is scientific proof of relationship between diabetes and lifestyle habits, it is potential to decrease the diabetes prevalence by implementation of certain campaigns targeting at those habits. To ensure resources-efficiency of the campaigns, this report explore the relationship between diabetes prevalence and dietary habits, and then identifies communities with higher risk of diabetes.

report aim: Facilitating formulation of strategies to decrease diabetes prevalence via identification of communities with higher risk of diabetes prevalence

research question: What are the distribution of diabetes prevalence in London, and what are the underlying factors accountable to this distribution?

Structure

This report is structured as the following to provide a systematic review and analysis of diabetes prevalence in London and relevant attributes. I started a board scale and narrowed down the scope based on the analysis result to ensure the significance of the analysis is well-supported. I commenced the analysis with the impact of the mere availability of groceries on diabetes prevalence, then deepened the analysis to the nutrient content of groceries. After establishing the presumably existing relationship between diabetes and dietary habits, I moved on to identify community profiles with high risk of diabetes with machine learning. The report is finished with the identification of communities with high risk of diabetes.

Set up

#set up working directory
setwd("~/Desktop/DEP5111/Asm/Asm2")
#install libraries
library(tidyverse)
library(ggthemes)
library(scales)
library(sf)
library(tmap)
library(knitr)
library(GGally)
library(RColorBrewer)
library(viridis)
library(wesanderson)
library(ggridges)
library(stringr)
library(caret)
library(lares)
library(plotly)
theme_set(theme_clean())

Data Exploration

#import data
grocery <- read_csv("Data/Area-level grocery purchases/year_osward_grocery.csv")
diabetes <- read_csv("Data/Validation data (obesity, diabetes)/diabetes_estimates_osward_2016.csv")
grocery_borough <- read_csv("Data/Area-level grocery purchases/year_borough_grocery.csv")
diabetes_borough <- read_csv("Data/Validation data (obesity, diabetes)/london_obesity_borough_2012.csv")
demographic_ward <- read_csv("Data/ward-profiles.csv")
head(grocery, 5)
## # A tibble: 5 × 202
##   area_id   weight weight_perc2.5 weight_perc25 weight_perc50 weight_perc75
##   <chr>      <dbl>          <dbl>         <dbl>         <dbl>         <dbl>
## 1 E05000026   450.           32.5          166.           300           500
## 2 E05000027   413.           32.5          150            300           500
## 3 E05000028   407.           32.5          160            300           500
## 4 E05000029   384.           30            150            250           454
## 5 E05000030   357.           30            140            250           450
## # ℹ 196 more variables: weight_perc97.5 <dbl>, weight_std <dbl>,
## #   weight_ci95 <dbl>, volume <dbl>, volume_perc2.5 <dbl>, volume_perc25 <dbl>,
## #   volume_perc50 <dbl>, volume_perc75 <dbl>, volume_perc97.5 <dbl>,
## #   volume_std <dbl>, volume_ci95 <dbl>, fat <dbl>, fat_perc2.5 <dbl>,
## #   fat_perc25 <dbl>, fat_perc50 <dbl>, fat_perc75 <dbl>, fat_perc97.5 <dbl>,
## #   fat_std <dbl>, fat_ci95 <dbl>, saturate <dbl>, saturate_perc2.5 <dbl>,
## #   saturate_perc25 <dbl>, saturate_perc50 <dbl>, saturate_perc75 <dbl>, …
head(diabetes, 5)
## # A tibble: 5 × 4
##   area_id   gp_patients gp_patients_diabetes estimated_diabetes_prevalence
##   <chr>           <dbl>                <dbl>                         <dbl>
## 1 E05000026       13136                 1068                           8.1
## 2 E05000027        8954                  631                           7  
## 3 E05000028       12032                  958                           8  
## 4 E05000029        8853                  700                           7.9
## 5 E05000030        8813                  640                           7.3
summary(diabetes)
##    area_id           gp_patients    gp_patients_diabetes
##  Length:637         Min.   :  164   Min.   :   5.0      
##  Class :character   1st Qu.: 9990   1st Qu.: 547.0      
##  Mode  :character   Median :12067   Median : 736.0      
##                     Mean   :12047   Mean   : 773.8      
##                     3rd Qu.:14003   3rd Qu.: 961.0      
##                     Max.   :24981   Max.   :2157.0      
##  estimated_diabetes_prevalence
##  Min.   : 2.000               
##  1st Qu.: 5.100               
##  Median : 6.200               
##  Mean   : 6.359               
##  3rd Qu.: 7.500               
##  Max.   :12.600

1. Geographical distribution of diabetes prevalence and availability of grocery

To start with, I would like to first establish if the diabetes prevalences in London demonstrate any spatial autocorrelation to ensure that I can continue with the exploration of geographical pattern of diabetes prevalence. Therefore, I decided to visualise with a choropleth map. On top of the map, I layered the distribution of Tesco stores in London, to understand if the mere availability of grocery itself has an impact on the diabete prevalence.

Data cleaning

In this part, I acquired the supermarket stores distribution in UK from Geolytix (https://geolytix.com/blog/supermarket-retail-points/) and borough boundary from London data store (https://data.london.gov.uk/dataset/statistical-gis-boundary-files-london).

I imported the csv files to Python, generating point geometry out of the longtitude and latitude columns. Since the focus of the analysis is on Tesco and London, I filtered out irrelevant data. I then join the filtered dataframe and the london borough boundary data to assign borough value to each store data.

#Data of tesco store location in London
tesco_point <- read_csv("Data/Tesco location/tesco_point.csv")
head(tesco_point, 5)
## # A tibble: 5 × 14
##    ...1      gluid retailer fascia store_name   address_one address_two locality
##   <dbl>      <dbl> <chr>    <chr>  <chr>        <chr>       <chr>       <chr>   
## 1  3852 1010004041 Tesco    Tesco  Bromley Sup… Homesdale … Bromley     Widmore 
## 2  3853 1010004042 Tesco    Tesco  Beckenham E… Croydon Ro… Beckenham   Elmers …
## 3  3921 1010004113 Tesco    Tesco  Thornton He… 32 Brigsto… Thornton H… Thornto…
## 4  3933 1010004125 Tesco    Tesco  Sidcup Supe… Edgington … Sidcup      Foots C…
## 5  3934 1010004126 Tesco    Tesco  Welling Sup… Welling Hi… <NA>        Welling 
## # ℹ 6 more variables: long_wgs <dbl>, lat_wgs <dbl>, geometry <chr>,
## #   ward <chr>, GSS_CODE <chr>, HECTARES <dbl>
#Data of London borough boundary
ward_gdf <- read_csv("Data/statistical-gis-boundaries-london/London_ward.csv")
borough_gdf <- read_csv("~/Desktop/DEP5111/Asm/Asm2/Data/statistical-gis-boundaries-london/London_borough.csv")

head(ward_gdf, 5)
## # A tibble: 5 × 9
##    ...1 NAME     GSS_CODE HECTARES NONLD_AREA LB_GSS_CD BOROUGH POLY_ID geometry
##   <dbl> <chr>    <chr>       <dbl>      <dbl> <chr>     <chr>     <dbl> <chr>   
## 1     0 Chessin… E050004…     755.          0 E09000021 Kingst…   50840 POLYGON…
## 2     1 Tolwort… E050004…     259.          0 E09000021 Kingst…  117160 POLYGON…
## 3     2 Berryla… E050004…     145.          0 E09000021 Kingst…   50449 POLYGON…
## 4     3 Alexand… E050004…     269.          0 E09000021 Kingst…   50456 POLYGON…
## 5     4 Beverley E050004…     188.          0 E09000021 Kingst…  117161 POLYGON…
head(borough_gdf, 5)
## # A tibble: 5 × 9
##    ...1 NAME   GSS_CODE HECTARES NONLD_AREA ONS_INNER SUB_2009 SUB_2006 geometry
##   <dbl> <chr>  <chr>       <dbl>      <dbl> <lgl>     <lgl>    <lgl>    <chr>   
## 1     0 Kings… E090000…    3726.        0   FALSE     NA       NA       POLYGON…
## 2     1 Croyd… E090000…    8649.        0   FALSE     NA       NA       POLYGON…
## 3     2 Broml… E090000…   15013.        0   FALSE     NA       NA       POLYGON…
## 4     3 Houns… E090000…    5659.       60.8 FALSE     NA       NA       POLYGON…
## 5     4 Ealing E090000…    5554.        0   FALSE     NA       NA       POLYGON…
#converting the geometry columns from character to polygons, so that tmap can read it
ward_sf <- st_as_sf(ward_gdf, wkt = 'geometry')
ward_sf%>%
  relocate(NAME) -> ward_sf

head(ward_sf, 5)
## Simple feature collection with 5 features and 8 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -0.3307564 ymin: 51.32632 xmax: -0.244683 ymax: 51.41226
## CRS:           NA
## # A tibble: 5 × 9
##   NAME               ...1 GSS_CODE HECTARES NONLD_AREA LB_GSS_CD BOROUGH POLY_ID
##   <chr>             <dbl> <chr>       <dbl>      <dbl> <chr>     <chr>     <dbl>
## 1 Chessington South     0 E050004…     755.          0 E09000021 Kingst…   50840
## 2 Tolworth and Hoo…     1 E050004…     259.          0 E09000021 Kingst…  117160
## 3 Berrylands            2 E050004…     145.          0 E09000021 Kingst…   50449
## 4 Alexandra             3 E050004…     269.          0 E09000021 Kingst…   50456
## 5 Beverley              4 E050004…     188.          0 E09000021 Kingst…  117161
## # ℹ 1 more variable: geometry <POLYGON>
tesco_point%>%
  relocate(store_name) -> tesco_tm_prep
tesco_tm <- st_as_sf(tesco_tm_prep, coords = c("long_wgs", "lat_wgs"), crs = 4326)
#joining datasets
left_join(ward_sf, diabetes, by = c("GSS_CODE" = "area_id")) -> diabetes_gdf_temp

head(diabetes_gdf_temp)
## Simple feature collection with 6 features and 11 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -0.3307564 ymin: 51.32632 xmax: -0.244683 ymax: 51.43729
## CRS:           NA
## # A tibble: 6 × 12
##   NAME               ...1 GSS_CODE HECTARES NONLD_AREA LB_GSS_CD BOROUGH POLY_ID
##   <chr>             <dbl> <chr>       <dbl>      <dbl> <chr>     <chr>     <dbl>
## 1 Chessington South     0 E050004…     755.          0 E09000021 Kingst…   50840
## 2 Tolworth and Hoo…     1 E050004…     259.          0 E09000021 Kingst…  117160
## 3 Berrylands            2 E050004…     145.          0 E09000021 Kingst…   50449
## 4 Alexandra             3 E050004…     269.          0 E09000021 Kingst…   50456
## 5 Beverley              4 E050004…     188.          0 E09000021 Kingst…  117161
## 6 Coombe Hill           5 E050004…     442.          0 E09000021 Kingst…  117159
## # ℹ 4 more variables: geometry <POLYGON>, gp_patients <dbl>,
## #   gp_patients_diabetes <dbl>, estimated_diabetes_prevalence <dbl>
#selecting relevant columns
diabetes_gdf_temp%>%
  select(NAME, GSS_CODE, HECTARES, gp_patients, gp_patients_diabetes, estimated_diabetes_prevalence, geometry) -> diabetes_sf
tm_diabetes_prev <- tm_shape(diabetes_sf)+
  tm_borders(col = "black")+
  tm_polygons(col = "estimated_diabetes_prevalence", 
              popup.vars = c("Estimated diabetes prevalence" = "estimated_diabetes_prevalence"), 
              title = "Estimated diabetes prevalence", 
              palette = "Blues")+
  tm_shape(tesco_tm)+
  tm_dots(size = 0.15, 
          "orange", 
          popup.vars=c("Ward"="ward", "Address" = "address_one"), 
          clustering = TRUE)+
  tm_layout(
    title = "Diabete prevalence in London with respect to Tesco store distribution"
  )

tmap_leaflet(tm_diabetes_prev)

Insight

Despite the missing data, the following insights can be drawn.

Firstly, spatial autocorrelation exists for both diabetes prevalences and Tesco stores distribution. Clustering of wards with high and low diabetes prevalence can be observed in the choropleth map. The diabetes prevalence is higher in the outskirts of London and lower in the central London. As for Tesco stores, the distribution is most concentrated in central london, and gradually decreases in the outskirt.

Secondly, the geographical distribution of Tesco store in the area have an inverse spatial relationship with the diabete prevalance distribution. It is however not to be concluded that that Tesco stores decreases diabetes prevalence. In the following, I will explore the various factors that attribute to such a distribution of diabetes prevalence.

2. Relationshihp between grocery contents and diabetes prevalence

I then conduct simple data exploration to understand the relationship between the content of the grocery and diabetes prevalance when the availability of grocery itself does not affect the diabetes prevalence. In specific I would like to understand the relationship between nutrient content and diabete prevalence.

#joining data
full_join(grocery, diabetes) -> grocery_diabetes
head(grocery_diabetes, 5)
## # A tibble: 5 × 205
##   area_id   weight weight_perc2.5 weight_perc25 weight_perc50 weight_perc75
##   <chr>      <dbl>          <dbl>         <dbl>         <dbl>         <dbl>
## 1 E05000026   450.           32.5          166.           300           500
## 2 E05000027   413.           32.5          150            300           500
## 3 E05000028   407.           32.5          160            300           500
## 4 E05000029   384.           30            150            250           454
## 5 E05000030   357.           30            140            250           450
## # ℹ 199 more variables: weight_perc97.5 <dbl>, weight_std <dbl>,
## #   weight_ci95 <dbl>, volume <dbl>, volume_perc2.5 <dbl>, volume_perc25 <dbl>,
## #   volume_perc50 <dbl>, volume_perc75 <dbl>, volume_perc97.5 <dbl>,
## #   volume_std <dbl>, volume_ci95 <dbl>, fat <dbl>, fat_perc2.5 <dbl>,
## #   fat_perc25 <dbl>, fat_perc50 <dbl>, fat_perc75 <dbl>, fat_perc97.5 <dbl>,
## #   fat_std <dbl>, fat_ci95 <dbl>, saturate <dbl>, saturate_perc2.5 <dbl>,
## #   saturate_perc25 <dbl>, saturate_perc50 <dbl>, saturate_perc75 <dbl>, …
#Selecting different nutrients to compare with the diabetes prevalence
grocery_diabetes%>%
  select(fat, saturate, salt, sugar, protein, estimated_diabetes_prevalence) -> gro_dia.corr
ggpairs(gro_dia.corr, 
        columnLabels = gsub('_', ' ', colnames(gro_dia.corr), fixed = T),
        labeller = label_wrap_gen(10))+
  labs(
    title = "Relationship between Grocery Nutrient Content and Diabetes Prevalence",
    caption = "Data from Aiello et al. (2020)"
  )

Insight

From the above data the nutrient contents generally demonstrated correlation with diabetes prevalence, with fat, satuation, salt, sugar demonstrating a positive correlation and protein demonstrating a negative correlation. Among them, weight of sugar in the has the highest positive correlation with the diabetes prevalence in the ward. Therefore, I decided to move forward with the relationship between sugar content in different wards and estimated diabete prevalence.

3. Purchasing pattern in relationship to diabetes prevalence

After understanding the correlation between nutrient and diabetes prevalence, I proceed to explore the purchase pattern in different areas of London. In this part I want to learn if the purchasing pattern of groceries in terms of London varies, and if the presumed variation exhibits any impact of diabetes prevalence.

Purchasing patter in regards of nutrient

As there are over 800 wards in London, it would be overwhelming to show the purchase pattern. Therefore, I decided to step back to the borough level to visualise the purchase pattern regarding each nutrient to establish relationship between purchase pattern and diabete prevalence.

#stacked bar chart
full_join(borough_gdf, grocery_borough, by = c("GSS_CODE" = "area_id"))%>%
  select(NAME, fat, saturate, salt, sugar, protein)%>%
  #changing the data from tidy to long
  pivot_longer(!NAME, names_to = "nutrient")%>%
  ggplot(aes(y=reorder(NAME, value), x=value, fill=nutrient, label = value))+
  geom_col()+
  labs(
  title = "Purchase Pattern in Each Borough in London",
  x = "weight of nutrients (gram)",
  y = "borough",
  caption = "Data from Aiello et al. (2020) and Greater London Authority (2020)")+
  scale_fill_manual(values = wes_palette("Royal2", n = 5))+
  theme(legend.text = element_text(size = 8), legend.position = "top", legend.justification = "left")

Insight

While there was slight variation of purchase pattern in different borough, they demonstrated the following trend: Sugar and fat have the highest average weights in a product in all borough, followed by protein, then saturate, and finally salt. As established, both sugar and fat has a positive correlation with diabetes prevalence. Therefore, the fact that these two nutrients occupy the most weight in an average product posese potential concern. The nutrient compositions of groceries are generally consistent. There are observable variation of sugar weight in different boroughs, indicating a variation of dietary habits. It is therefore potential to continue the analysis to understand the impact of dietary habits on diabetes prevalence.

Relationship between sugar contents and diabetes prevalence

Since sugar has the highest correlation and highest weight, I decided to deepen my analysis on its relationship to diabetes prevalence.

As the dataset from Aiello et al did not include the average diabetes prevalence on a borough level, I first aggregate estimated diabetes prevalence in each ward to borough level outside the R environment.

# exporting ward level diabetes prevalence (diabetes_sf) as csv file 
diabetes_prev_borough <- read_csv("Data/diabetes_borough.csv")

head(diabetes_prev_borough, 5)
## # A tibble: 5 × 5
##    ...1 gss_code  name                 geometry           estimated_diabetes_p…¹
##   <dbl> <chr>     <chr>                <chr>                               <dbl>
## 1     0 E09000001 City of London       POLYGON ((-0.1115…                  NA   
## 2     1 E09000002 Barking and Dagenham MULTIPOLYGON (((0…                   7.54
## 3     2 E09000003 Barnet               POLYGON ((-0.1995…                   6.11
## 4     3 E09000004 Bexley               POLYGON ((0.12455…                   6.92
## 5     4 E09000005 Brent                POLYGON ((-0.1968…                   8.67
## # ℹ abbreviated name: ¹​estimated_diabetes_prevalence
#Bar chart with lolipop plot
right_join(grocery_borough, diabetes_prev_borough, by = c("area_id" = "gss_code"))%>%
  select(name, sugar, protein, weight, estimated_diabetes_prevalence)%>%
  filter(!is.na(estimated_diabetes_prevalence) & estimated_diabetes_prevalence > 0)%>%
  group_by(name, estimated_diabetes_prevalence)%>%
  mutate(sugar_prop = sugar/weight*100, protein_prop = protein/weight*100)%>%
  ggplot()+
  geom_hline(
    aes(yintercept = y), 
    data.frame(y = c(0:4) * 2.5),
    color = "lightgrey"
  ) +
  geom_col(
    aes(x = reorder(str_wrap(name, 5), estimated_diabetes_prevalence), y = estimated_diabetes_prevalence, fill=estimated_diabetes_prevalence),
    position = "dodge2",
    show.legend = TRUE,
    alpha = 0.9,
  )+
  scale_fill_gradientn(
    "diabete prevalence (%)",
    colours = c( "#F8B195","#F67280","#C06C84", "#6C5B7B")
  )+
  geom_segment(
    aes(
      x = reorder(str_wrap(name, 5), estimated_diabetes_prevalence),
      y = 0,
      xend = reorder(str_wrap(name, 5), estimated_diabetes_prevalence),
      yend = 10
    ),
    linetype = "dashed",
    color = "gray93"
  )+
  geom_segment(
    aes(
      x = reorder(str_wrap(name, 5), estimated_diabetes_prevalence),
      y = 0,
      xend = reorder(str_wrap(name, 5), estimated_diabetes_prevalence),
      yend = sugar_prop
    ),
    color = "gray12",
    size = 1
  )+
  geom_point(
    aes(x = reorder(str_wrap(name, 5), estimated_diabetes_prevalence), y = sugar_prop, color = sugar_prop),
    size = 3,
    )+
  scale_color_gradientn("sugar weight (%)",
                        colours = c("#f0efeb", "#99c1de", "#000000"))+
  coord_polar()+
  labs(
    title = "Diabetes Prevalence in Each Borough in Relationship to \nAverage Sugar Content in a Grocery Product", 
    caption = "Data from Aiello et al. (2020)"
  )+
  annotate(
    x = 6.9, 
    y = 1.5,
    label = "sugar \nproportion",
    geom = "text",
    angle = 7,
    color = "gray12",
    size = 2.3,
    lineheight = 1.1
  )+
  annotate(
    x = 7, 
    y = 6.3,
    label = "estimated \n diabetes \nprevalence",
    geom = "text",
    angle = -83,
    color = "gray12",
    size = 2.7,
    lineheight = 0.9
  ) +
  annotate(
    x = 1.5, 
    y = 5.0, 
    label = "5.0%", 
    geom = "text", 
    color = "gray12",
    size = 3,
    angle = -12
  ) +
  annotate(
    x = 1.5, 
    y =7.5, 
    label = "7.5%", 
    geom = "text", 
    color = "gray12",
    size = 3,
    angle = -12
  )+
  annotate(
    x = 1.5, 
    y =10, 
    label = "10.0%", 
    geom = "text", 
    color = "gray12",
    size = 3,
    angle = -12
  )+
  scale_y_continuous(
    limits = c(-1.5, 11),
    expand = c(0, 0),
    breaks = c(0, 1000, 2000, 3000)
  ) +
  guides(
    color = guide_colorsteps(
     barwidth = 15, barheight = .5, title.position = "top", title.hjust = .5 
    ),
    fill = guide_colorsteps(
      barwidth = 15, barheight = .5, title.position = "top", title.hjust = .5
    )
  ) +
  theme(
    axis.title = element_blank(),
    axis.ticks = element_blank(),
    axis.text.y = element_blank(),
    axis.text.x = element_text(color = "gray12", size = 9, vjust = 2),
    panel.grid = element_blank(),
    panel.grid.major.x = element_blank(),
    legend.text = element_text(size = 8), 
    legend.position = "top", 
    legend.justification = "left",
  )

Insight

From this plot, the variation of diabetes prevalence is more observable then the proportion of sugar in the average product. However, the general trend is the diabetes prevalence increases when the proportion of sugar increases but there are multiple exceptions. For instance, Sutton, the borough with highest average proportion of sugar in a grocery product, does not have a high diabetes prevalence. Whilst, the average proportion of sugar in Hammersmith and Fuham is similar to that in Harrow, yet they are boroughs with the highest and lowest diabetes prevalence respectively. Thus, this plot, while confirming relationship between sugar content and diabetes prevalence, also indicates the limitations of sugar content in explaining the diabetes prevalence.

4. Identification of community profiles with high risk of diabetes

I proceeded with the exploration of sugar content as it has a positive correlation with diabetes prevalence (0.467) despite the difference in strength in different boroughs.

As the previous plot shown, there are differences in the average sugar contents in a grocery products in different areas of London. In this part, I want to identify community with higher risk of diabetes through over-consumption of sugar by compasing sugar contents to community profile.

I retrieve the demographic data from London Data Store (https://data.london.gov.uk/dataset/ward-profiles-and-atlas).

#ward profile dataframe
head(demographic_ward, 5)
## # A tibble: 5 × 67
##   `Ward name`   `Old code` `New code` `Population - 2015` Children aged 0-15 -…¹
##   <chr>         <chr>      <chr>                    <dbl>                  <dbl>
## 1 City of Lond… 00AA       E09000001                 8100                    650
## 2 Barking and … 00ABFX     E05000026                14750                   3850
## 3 Barking and … 00ABFY     E05000027                10600                   2700
## 4 Barking and … 00ABFZ     E05000028                12700                   3200
## 5 Barking and … 00ABGA     E05000029                10400                   2550
## # ℹ abbreviated name: ¹​`Children aged 0-15 - 2015`
## # ℹ 62 more variables: `Working-age (16-64) - 2015` <dbl>,
## #   `Older people aged 65+ - 2015` <dbl>,
## #   `% All Children aged 0-15 - 2015` <dbl>,
## #   `% All Working-age (16-64) - 2015` <dbl>,
## #   `% All Older people aged 65+ - 2015` <dbl>, `Mean Age - 2013` <dbl>,
## #   `Median Age - 2013` <dbl>, `Area - Square Kilometres` <dbl>, …

Considering there are many columns in the dataset, I decided to clean and transformed the data first.

# selecting and renaming columns
demographic_ward%>%
  select(`Ward name`, `New code`, `Mean Age - 2013`, `% BAME - 2011`, `Employment rate (16-64) - 2011`, `Median House Price (£) - 2014`)%>%
  rename(name = `Ward name`, code = `New code`, age_mean = `Mean Age - 2013`, BAME_perc = `% BAME - 2011`, employment_perc = `Employment rate (16-64) - 2011`, house_pri_median = `Median House Price (£) - 2014`) -> cleaned_demographic_ward

The selected columns represents the following aspects: age, ethnicity, employment and economic power.

I then conducted machine learning to understand the relationship of demographic profile and purchasing pattern.

# joining data
left_join(grocery, cleaned_demographic_ward, by = c("area_id" = "code"))%>%
  select(sugar, age_mean, BAME_perc, employment_perc, house_pri_median) -> gro_demo

# filtering out data with null value
gro_demo <- na.omit(gro_demo)
# data partition
vi <- createDataPartition(gro_demo$sugar, p=0.80, list=FALSE)
# grouping training and testing data
training <- gro_demo[vi,]
test <- gro_demo[-vi,]
#training data
m <- train(sugar ~ ., data = training, importance = FALSE)
# Adding the predicted sugar content result in the testing data
test$predicted <- predict(m, test)
postResample(pred = test$predicted, obs=test$sugar)
##      RMSE  Rsquared       MAE 
## 0.7172419 0.4486609 0.5696833

This model shows a certain amount of derivation of the predicted value from the actual value as demonstrated by the root mean square erro and mean absolute error. The R-squared value shows that 59.0% of the average weigh of sugar in a product in a ward can be explained by this model. Therefore, it is still useful to consider this model despite its limitations.

var_imp <- varImp(m)
variable_names <- rownames(var_imp$importance)
new_var_names <- c("Mean age", "Proportion of BAME", "Proportion of employed population", "Median housing price")
rownames(var_imp$importance) <- new_var_names 
ggplot(var_imp)+
  labs(
    title = "Most Relevant Variables",
    caption = "BAME refers to Black, Asian and Minority Ethnic \n\n Data from Aiello et al. (2020) and Greater London Authority (2013)"
  )

Since the median housing price has the highest importance, I decided to move on with this aspect.

ggplot(gro_demo, aes(x = sugar, y = house_pri_median))+
  geom_hex()+
  labs(
    title = "Relationship between Median Housing Price and Average Sugar Content in Each Ward",
    x = "sugar weight (g)",
    y = "median housing price (GBP)",
    caption = "Data from Aiello et al. (2020) and Greater London Authority (2013)")+
  theme(legend.text = element_text(size = 8), 
        legend.position = "top", 
        legend.justification = "left")+
  scale_fill_gradientn(
    "count of wards",
    colours = c( "#F8B195","#F67280","#C06C84", "#6C5B7B")
  )+
  scale_y_continuous(labels = comma)+
  guides(
    fill = guide_colorsteps(
      barwidth = 15, barheight = .5, title.position = "top", title.hjust = .5
    )
  )

Insight

Based on the above bubble map, it can be seen that the higher the median housing price is, the lower the average sugar content is in a product. It indicates that people with a higher economic power is also capable of making purchasing choice that decrease their risk of having diabetes. On the contrast, people with lower economic power are at greater risk of diabetes because of their higher consumption of sugar from a product.

The government can use these understanding to identify community and population with greater risk of diabetes due to overconsumption of sugar and implement campaign in this regards.

As a final step of this report, I identify the wards with the lowest 25% housing price in London and visualise them in a map.

#5. Identification of community with high risk of diabetes.

# joining data to sf
full_join(ward_sf, demographic_ward, by=c("GSS_CODE" = "New code"))%>%
  select(NAME, BOROUGH, geometry, `Median House Price (£) - 2014`)%>%
  rename(Name = NAME, Borough = BOROUGH, house_pri_median = `Median House Price (£) - 2014`) ->demo_ward_sf
demo_ward_sf$lq <- ifelse(
  is.na(demo_ward_sf$house_pri_median),
  "Missing",
  ifelse(
    demo_ward_sf$house_pri_median <= quantile(demo_ward_sf$house_pri_median, 0.25, na.rm = TRUE),
    "Lowest 25%",
    "Outside Lowest 25%"
  )
)

#omitting null value w/o spatial data
demo_ward_sf%>%
  filter(!is.na(demo_ward_sf$Name)) ->demo_ward_sf

head(demo_ward_sf)
## Simple feature collection with 6 features and 4 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -0.3307564 ymin: 51.32632 xmax: -0.244683 ymax: 51.43729
## CRS:           NA
## # A tibble: 6 × 5
##   Name                  Borough                  geometry house_pri_median lq   
##   <chr>                 <chr>                   <POLYGON>            <dbl> <chr>
## 1 Chessington South     Kingst… ((-0.330679 51.32901, -0…           315000 Outs…
## 2 Tolworth and Hook Ri… Kingst… ((-0.3084572 51.37586, -…           337195 Outs…
## 3 Berrylands            Kingst… ((-0.3038496 51.39249, -…           361125 Outs…
## 4 Alexandra             Kingst… ((-0.2699001 51.38845, -…           404975 Outs…
## 5 Beverley              Kingst… ((-0.246622 51.39921, -0…           435000 Outs…
## 6 Coombe Hill           Kingst… ((-0.2471369 51.40958, -…           480000 Outs…
#getting Wes Anderson colour palette
house_pri_pal <- rev(wes_palette("Royal1", n = 3))
#rearranging order
house_pri_pal <- c("#C93312", "#899DA4", "#FAEFD1")

tm_house_pri <- tm_shape(demo_ward_sf)+
  tm_borders(col = "black")+
  tm_polygons(col = "lq", 
              popup.vars = c("Ward" = "Name", "Borough", "Median housing price (GBP)" = "house_pri_median"), 
              title = "Median housing price", 
              palette = house_pri_pal)+
  tm_layout(
    title = "Wards with Lowest 25% Median Housing Price"
  )

tmap_leaflet(tm_house_pri)

insight

Highlighted in red are the areas with the lowest 25% median housing price, which are all wards at the outskirt of London. It coincides with the map of diabetes prevalence in London.

This map, together with the machine learning model, serves as a tool for the authority to identity community to implement strategies to decrease diabetes prevalence in regards of sugar consumption.

Conclusion

I endeavoured to facilitate the decrease of diabetes prevalence in London via data analysis and visualisation by understanding the factors of diabetes prevalence from the aspects of grocery nutrient and community profile. In the process, I developed a toolkit for identification of communities with high risk in diabetes. A machine learning model is developed to classify the weight of consumption of sugar based on age, ethnicity, employment and economic power. A map to indicate current wards with a high risk of diabetes is enclosed.

limitation

Several inherent limitations are observed in this analysis. 1. Limited Report scales Since this report is limited in scale and length, several supportive analysises are not conducted to establish the significance of analysis. For instance, no statistical analysis, like Moran’ I and regression analysis. The weight of nutrient content is not compared against the energyy provided. Food categories are also not included in the scope, which helps provide a more intuitive understanding of purchasing pattern.

2. Data age Data adapted in this report varies in age due to data availability although I have endeavoured to select dataframe from 2014 to 2015. Moreoever, the grocery and diabetes prevalence data are both from almost ten years ago, in 2015, whose relevance may decrease.

Recommendation

Below I concluded the insights from this analysis, which provides a foundation for future research.

1. Purchasing pattern has a significant relationship with diabetes prevalence From the initial data exploration all nutrient contents in a grocery product demonstrates a correlation with diabetes prevalence. While this report focuses on sugar weight, both fat and satuate worth exploring as they both have a medium correlation with diabetes prevalence. Future research and attempt may focus on the behavioural changes in grocery pattern.

2. Certain communities have higher risks of diabetes due to dietary habits This report demonstrates a community with higher consumption of sugar have a higher risk diabetes, despite exceptions as sugar is not the sole factor behind diabetes prevalence. Whilst, . Therefore, to ensure efficiency of the action on decreasing diabetes prevalences, it is possible to focus on certain communities first. A community-scale action instead of a national campaign, for instance, is more appropriate to locate target audiences.

3. Potential empirical example of spurious associations In the beginning of the report Tesco store distribution demonstrates an inverse spatial relationship with diabetes prevalence. In the end it is deduced that wards with lower economic power tends to consume more sugar on average in a grocery product, which potentially leads to higher diabetes prevalence. From the last map wards with lowest 25% median housing prices shows a similar distribution to community with high diabetes prevalence, as they both cluster in the outskirts of London. It is possible that such a low economic power also contributes to the lower concentration of Tesco store, though it should be noted without further research, this is merely a hypothese at this stage. It indicates, thus, the spatial correlation does not imply causation relationship and serve as a warning for future research.

Appendix

Python file of data cleaning process https://colab.research.google.com/drive/1DyZs0ZhCT-_F2lEAV5gg4rhpbGfYMAQP?usp=sharing

Reference

Aiello, Luca Maria; Schifanella, Rossano; Quercia, Daniele; Del Prete, Lucia (2020). Tesco Grocery 1.0. figshare. Collection. https://doi.org/10.6084/m9.figshare.c.4769354.v2

Diabetes UK (2021). Diabetes Hits Almost 600,000 in London. https://www.diabetes.org.uk/in_your_area/london/london-region-news-/prevalance#:~:text=New%20analysis%20shows%20that%20the,–%206.7%25%20of%20the%20population.

Geolytix (2023). Supermarket Retail Points. https://geolytix.com/blog/supermarket-retail-points/

Greater London Authority (2020). Statistical GIS Boundary Files for London. London Data Store. https://data.london.gov.uk/dataset/statistical-gis-boundary-files-london

Greater London Authority (2013). Ward Profiles and Atlas. London Data Store. https://data.london.gov.uk/dataset/ward-profiles-and-atlas